In [ ]:

    
from traitlets.config.manager import BaseJSONConfigManager
from IPython.core.display import display, HTML
from numpy import *
from numpy.random import *
from matplotlib.pyplot import figure, show, draw, tight_layout
from numpy import log2, ceil
import sympy
%matplotlib inline

path = '/home/datasci/.jupyter/nbconfig'
cm = BaseJSONConfigManager(config_dir=path)

cm.update('livereveal',
         {
             'theme': 'safri',
             'start_slideshow_at': 'selected',
             'width': 1280,
             'height': 960,
             'scroll': False,
             'progress': True,
             'controls': True,
             'slideNumber': True
         })

An Introduction to Data Science
using Docker and Jupyter

Author Josh Cole

Previous Company: General Dynamics

Previous Position: Systems Engineer

Unviersity: The University of Bristol

Studied: MEng in Electronics and Communications

Current Role: Sofware Engineer/Data Science/Big Data

Overview

Why Docker?
Why Jupyter?
Can anyone Get involved with Data Science?
What Skills do you need?
Big data the four Vs
Learning Resources
A brief play with Docker and Jupyter

Why Docker?

Standard Dev Environment

Configuring a data science environment can be a pain
Dealing with inconsistent package versions
Having to dive through obscure error messages
Wait hours for packages to compile can be frustrating

Moving to Docker

The above makes it hard to get started with data science in the first place, and is a completely arbitrary barrier to entry
Dealing with inconsistent package versions
With Docker, we can download an image file that contains a set of packages and data science tools

Why Jupyter?

It excels in literate programming, a software style pioneered by Stanford computer scientist, Donald Knuth
Allows users to formulate, and describe their thoughts with prose, supplemented by mathematical equations as they prepare to write code blocks
Commonly used in:

Demonstrations
Research
Teaching
Collaborative exercise

Supports:

Latex equations using MathJax
MarkDown Cells
Interactive presentations

Can anyone Get involved with Data Science?

Do you need PhD in Statistics/Machine Learning?

Many data scientists acquired their quantitative and statistical modeling skills in college, but pursued degrees in business, economics and engineering
The actually know about business problems

Do Data Scientist get the hands Dirty?

Data Scientists get their fingernails dirty dumping piles of data in analytical sandboxes, cleansing, and sifting through it for useful patterns that may or may not exist. Then, they do it all over again.

What skills do you need?

Data Science Profile from Doing Data Science by Cathy O'Neil and Rachel Schutt

There is no "I" n "Team": Don't go it alone

Big Data the Four Vs

Volume: Data at rest i.e. the amount of data
Variety: Data in many forms:

Different types of data (e.g. structured, semi-structured and unstructured data
Different data source (e.g. internal, external, public)

Velocity: data in motion i.e the speed at which data is generated and processed
Veracity: data in doubt i.e. the varying levels of noise and processing errors

Learning Resources

Coursera: high quality courses on Data Science/Machine Learning/Statistics: https://www.coursera.org/
Udacity: similar to Coursera: https://www.udacity.com/
Kaggle: competitions, datasets tutorials: https://www.kaggle.com/
KDnuggets: datasets, blogs and tutorials: https://www.kaggle.com/
Just Google it!

A brief play with Docker and Jupyter

Github: https://github.com/JoshCole/DataScience-Stack
A quick Tour

An Introduction to Data Science using Docker and Jupyter